This document contains instructions on Project 2 for STA 141A in Winter 2021. This document is made with R markdown. The rmd file to generate this document is available on the course website.
We will do an EDA for the WHO COVID-19 data. You can take a look at the weekly WHO COVID-19 update for reference.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
covid <- read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Date_reported = col_date(format = ""),
## Country_code = col_character(),
## Country = col_character(),
## WHO_region = col_character(),
## New_cases = col_double(),
## Cumulative_cases = col_double(),
## New_deaths = col_double(),
## Cumulative_deaths = col_double()
## )
This data set is maintained by WHO and updated constantly. The first task for you is to understand this data set, e.g., the meaning of the variables and their values. To this end, you can make use of your preferred search engine, or read the documentation for this data set.
In this project, you are a team of conscientious statisticians, who wish to help the general public understand the ongoing pandemic.
The following list provides one potential structure of the data analysis report. As this is the final project, the following suggestions are intended to provide one viable route for your project while leaving you as much freedom as possible.
Before writing your analysis report, you may want to explore this data set and read about the coronavirus to generate the hypothesis or question to be answered in this report, i.e., the question(s) of interest. You can be creative on this question so long as it meets three conditions.
In this Exploratory Data Analysis of the World Health Organization (WHO) COVID-19 database, we determine the impact of COVID-19 on global economic health. We consider many factors in this analysis such as unemployment rates, energy prices, Gross Domestic Product (GDP), and other such criteria to get a sense of how economies progress as the disease travels through the population.
In early 2020, an outbreak of the coronavirus known as COVID-19 spread from China impacting people all over the world. The initial outbreak caused many countries to immediately go into lockdown and citizens were required to stay at home to quarantine as a means to contain the spread of the virus. As a result, countries were producing goods in lower quantities and thus GDP went down. The purpose of this analysis is to visualize how global economies responded to COVID-19 over time. Furthermore, if such a pandemic ever occurs in the future, we may be able to predict how each country will be impacted and come up with a solution where the economy will not suffer as much.
We first analyze the initial WHO dataset to get a sense of the trends of data per fiscal quarter, which we define to be a three month period with the first day of the first quarter being January 3, 2020 (So then the subsequent quarters will be 3 months apart and also start on the 3rd). We first take the data set from the WHO and filter out the dates that were not a part of the past year. In other words, these are the dates between 1/3/2020 and 12/31/2020. Then we create a new column called Date_reported which lists the start of the fiscal quarter that we described earlier. We show the data set below:
head(covid)
## # A tibble: 6 x 8
## Date_reported Country_code Country WHO_region New_cases Cumulative_cases
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1/3/2020 AR Argent~ AMRO 0 0
## 2 1/3/2020 AR Argent~ AMRO 0 0
## 3 1/3/2020 AR Argent~ AMRO 0 0
## 4 1/3/2020 AR Argent~ AMRO 0 0
## 5 1/3/2020 AR Argent~ AMRO 0 0
## 6 1/3/2020 AR Argent~ AMRO 0 0
## # ... with 2 more variables: New_deaths <dbl>, Cumulative_deaths <dbl>
We wish to see the average number of cases/deaths per day from COVID-19 per quarter. We group by the country and the quarter to do so. Furthermore, we are interested in seeing the percent change of cases/deaths after the preceding quarter. The following code demonstrates this process.
Ave_Summary$CasePercent_Change = NA
Ave_Summary$DeathPercent_Change = NA
for(i in c(0:50)){
for(j in c(1: 3)){
Ave_Summary$CasePercent_Change[4*i+j+1] = abs(Ave_Summary$Ave_NewCase[4*i+j] - Ave_Summary$Ave_NewCase[4*i+j + 1])/Ave_Summary$Ave_NewCase[4*i+j]
Ave_Summary$DeathPercent_Change[4*i+j+1] = abs(Ave_Summary$Ave_NewDeath[4*i+j] - Ave_Summary$Ave_NewDeath[4*i+j + 1])/Ave_Summary$Ave_NewDeath[4*i+j]
}
}
And now we plot the time series of these plots below. Note that we used a logarithmic scale for the raw number of deaths and cases.
figure1 = Ave_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = log(Ave_NewDeath), by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure2 = Ave_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = log(Ave_NewCase), by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure3 = Ave_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = CasePercent_Change, by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure4 = Ave_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = DeathPercent_Change, by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
ggarrange(figure1, figure2, figure3, figure4)
For better visualization of the progression of the disease for these countries, we have an interactive plot which you can slide to see the number of new deaths over new cases using the raw numbers.
Ave_Summary %>% plot_ly(
x= ~Ave_NewCase,
y= ~Ave_NewDeath,
frame = ~Date_reported,
text=~Country,
hoverinfo="Country",
color=~Country,
type = 'scatter',
mode = 'markers',
showlegend = T
)
We showed earlier the graphs for the average number of cases and deaths. Another way to visualize the data is to view for the cumulative number of cases/deaths in the fiscal quarter.
Cumul_Summary = covid %>% group_by(Country, Date_reported) %>%
summarize(Cumul_Case = sum(New_cases),
Cumul_Death = sum(New_deaths)
)
## `summarise()` regrouping output by 'Country' (override with `.groups` argument)
Cumul_Summary$CasePercent_Change = NA
Cumul_Summary$DeathPercent_Change = NA
for(i in c(0:50)){
for(j in c(1: 3)){
Cumul_Summary$CasePercent_Change[4*i+j+1] = abs(Cumul_Summary$Cumul_Case[4*i+j] - Cumul_Summary$Cumul_Case[4*i+j + 1])/Cumul_Summary$Cumul_Case[4*i+j]
Cumul_Summary$DeathPercent_Change[4*i+j+1] = abs(Cumul_Summary$Cumul_Death[4*i+j] - Cumul_Summary$Cumul_Death[4*i+j + 1])/Cumul_Summary$Cumul_Death[4*i+j]
}
}
Now we plot the time series similar to earlier.
figure5 = Cumul_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = log(Cumul_Case), by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure6 = Cumul_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = (Cumul_Death), by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure7 = Cumul_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = CasePercent_Change, by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
figure8 = Cumul_Summary %>%
ggplot(aes(x=as.Date(Date_reported), y = DeathPercent_Change, by=Country)) +
geom_line(aes(color=Country)) +
theme(legend.position ='none')
ggarrange(figure5, figure6, figure7, figure8)
And below we have the interactive plot similar to above but for the cumulative deaths over cases.
Cumul_Summary %>%
plot_ly(
x= ~Cumul_Case,
y= ~Cumul_Death,
frame = ~Date_reported,
text=~Country,
hoverinfo="Country",
color=~Country,
type = 'scatter',
mode = 'markers',
showlegend = T
)
Notice how the time series for Average Cases and Cumulative Cases are similar. We can say the same for Average Deaths and Cumulative Deaths. Perhaps there is a relation between the average number of COVID-19 cases/deaths with the total number for the fiscal quarter. Further analysis can explain why this is the case.
Another way to visualize this data is given by boxplots using a logarithmic scale. We display the data below:
figure9 = ggplot(Ave_Summary, aes(x = Date_reported, y = log(Ave_NewCase))) + geom_boxplot()
figure10 = ggplot(Ave_Summary, aes(x = Date_reported, y = log(Ave_NewDeath))) + geom_boxplot()
figure11 = ggplot(Cumul_Summary, aes(x = Date_reported, y = log(Cumul_Case))) + geom_boxplot()
figure12 = ggplot(Cumul_Summary, aes(x = Date_reported, y = log(Cumul_Death))) + geom_boxplot()
ggarrange(figure9, figure10, figure11, figure12)
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
Propose an appropriate model to answer the questions of interest.
Fit the proposed model in (4) and explain your results.
Conduct model diagnostics and/or sensitivity analysis.
Conclude your analysis with a discussion of your findings and caveats of your approach.